-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MONGOCRYPT-755 Implement StrEncode #928
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work. Only substantial comment is to limit string lengths to prevent possible overflows.
fprintf(stderr, | ||
"Testing nofold suffix/prefix case: str=\"%s\", lb=%u, ub=%u, unfolded_codepoint_len=%u\n", | ||
str, | ||
lb, | ||
ub, | ||
unfolded_codepoint_len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest using the TEST_PRINTF
and TEST_STDERR_PRINTF
macros to flush stdout/stderr and avoid mixed output in Evergreen logs. The macros were recently introduced in b193dba. Run git merge master
to include them.
TEST_STDERR_PRINTF("Testing nofold suffix/prefix case: str=\"%s\", lb=%u, ub=%u, unfolded_codepoint_len=%u\n",
str,
lb,
ub,
unfolded_codepoint_len);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to TEST_PRINTF's, thanks for this note.
#undef MIN | ||
#define MIN(a, b) (((a) < (b)) ? (a) : (b)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#undef MIN | |
#define MIN(a, b) (((a) < (b)) ? (a) : (b)) |
Suggest using BSON_MIN
to simplify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, totally forgot to do this in the test; changed it in the source file due to Erwin's comment earlier. Done.
@@ -119,6 +119,58 @@ bool mc_FLE2RangeInsertSpec_parse(mc_FLE2RangeInsertSpec_t *out, | |||
bool use_range_v2, | |||
mongocrypt_status_t *status); | |||
|
|||
typedef struct { | |||
// mlen is the max string length that can be indexed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// mlen is the max string length that can be indexed. | |
// mlen is the max string length (in characters, not bytes) that can be indexed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a clarifying comment above.
mc_FLE2TextSearchInsertSpec_t spec = | ||
{str, byte_len, {{0, 0, 0}, false}, {{lb, ub}, true}, {{0, 0}, false}, false, false}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mc_FLE2TextSearchInsertSpec_t spec = | |
{str, byte_len, {{0, 0, 0}, false}, {{lb, ub}, true}, {{0, 0}, false}, false, false}; | |
mc_FLE2TextSearchInsertSpec_t spec = {.v = str, .len = byte_len, .suffix = {{lb, ub}, true}}; |
Suggest using designated initializers and omitting fields that are expected to be zero-initialized to improve readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thanks for the feedback.
uint32_t affix_count = 0; | ||
uint32_t total_real_affix_count = 0; | ||
while (mc_affix_set_iter_next(&it, &affix, &affix_len, &affix_count)) { | ||
// Since all substrings are just views on the base string, we can use pointer math to find our start and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Since all substrings are just views on the base string, we can use pointer math to find our start and | |
// Since all substrings are just views on the base string, we can use pointer math to find our start and end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
ASSERT(sets->exact.len == byte_len); | ||
ASSERT(0 == memcmp(sets->exact.data, str, byte_len)); | ||
|
||
if (unfolded_codepoint_len > mlen || lb > max_padded_len) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (unfolded_codepoint_len > mlen || lb > max_padded_len) { | |
if (lb > max_padded_len) { |
Redundant with above check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch.
} | ||
set->start_indices[idx] = base_start_idx; | ||
set->end_indices[idx] = base_end_idx; | ||
set->substring_counts[idx] = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider storing and incrementing the current set size in mc_affix_set_t
, rather than requiring callers to track the index:
set->start_indices[set->cur_idx] = base_start_idx;
set->end_indices[set->cur_idx] = base_end_idx;
set->substring_counts[set->cur_idx] = 1;
set->cur_idx++;
That may help avoid exposing implementation details of mc_affix_set_t
to the caller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed, thanks for the suggestion.
src/mc-str-encode-string-sets.c
Outdated
it->cur_idx = 0; | ||
} | ||
|
||
bool mc_affix_set_iter_next(mc_affix_set_iter_t *it, const char **str, uint32_t *len, uint32_t *count) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool mc_affix_set_iter_next(mc_affix_set_iter_t *it, const char **str, uint32_t *len, uint32_t *count) { | |
bool mc_affix_set_iter_next(mc_affix_set_iter_t *it, const char **str, uint32_t *byte_len, uint32_t *count) { |
To clarify output is byte length, not character length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
src/mc-str-encode-string-sets.c
Outdated
// Linked list node in the hashset. | ||
typedef struct _mc_substring_set_node_t { | ||
uint32_t start_offset; | ||
uint32_t len; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uint32_t len; | |
uint32_t byte_len; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
mc_utf8_string_with_bad_char_t *mc_utf8_string_with_bad_char_from_buffer(const char *buf, uint32_t len) { | ||
BSON_ASSERT_PARAM(buf); | ||
mc_utf8_string_with_bad_char_t *ret = bson_malloc0(sizeof(mc_utf8_string_with_bad_char_t)); | ||
_mongocrypt_buffer_init_size(&ret->buf, len + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect this could overflow if len
is UINT32_MAX
. Similarly, the CBC length calculations may overflow when adding 15.
I suggest rejecting too-long strings in mc_text_search_str_encode
, and using a limit much smaller than UINT32_MAX
. If the limit is near UINT32_MAX
, these operations may be prohibitively slow to be useful and could risk a denial-of-service attack. Consider using 16MiB (16777216 bytes) to match the maximum insert size of a BSON document (maxBsonObjectSize
from the hello
command reply).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was originally thinking that we would perform this check at a higher level, but I think now I agree with you that it makes sense to do the checks here, nearer to where the actual algorithm is taking place. One thing to note is that for the substring case, even getting close to 16MiB is going to be way too big. Figuring out the max values for all the parameters that go in is still TBD for later in the project, so I'll put the limit at 16MiB for now and we can think more about this later.
No description provided.